{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "import sys\n",
    "import os\n",
    "if not any(path.endswith('textbook') for path in sys.path):\n",
    "    sys.path.append(os.path.abspath('../../..'))\n",
    "from textbook_utils import *"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(ch:files_command_line)=\n",
    "# The Shell and Command-Line Tools\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Nearly all computers provide access to a *shell interpreter*, such as `sh` or `bash` or `zsh`.  These interpreters typically perform operations on the files on a computer with their own language, syntax, and built-in commands.\n",
    "\n",
    "We use the term *command-line interface (CLI) tools* to refer to the commands available in a shell interpreter. Although we only cover a few CLI tools here, there are many useful CLI tools that enable all sorts of operations on files. For instance, the following command in the `bash` shell\n",
    "produces a list of all the files in the _figures/_ folder for this chapter along with their file sizes:\n",
    "\n",
    "```bash\n",
    "# The dollar sign is the shell prompt, showing the user where to type. It's\n",
    "# not part of the command itself.\n",
    "$ ls -l -h figures/\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The basic syntax for a shell command is:"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```bash\n",
    " command -options arg1 arg2\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "CLI tools often take one or more *arguments*, similar to how Python functions\n",
    "take arguments.\n",
    "In the shell, we wrap arguments with spaces, not with\n",
    "parentheses or commas.\n",
    "The arguments appear at the end of the command line, and they are\n",
    "usually the name of a file or some text. In the `ls` example, the\n",
    "argument to `ls` is `figures/`. Additionally, CLI tools support *flags* that\n",
    "provide additional options. These flags are specified immediately following the\n",
    "command name using a dash as a delimiter. In the `ls` example, we\n",
    "provided the flags `-l` (to provide extra information about each file) and `-h`\n",
    "(to provide file sizes in a more human-readable format). Many commands have\n",
    "default arguments and options, and the `man` tool prints a list of\n",
    "acceptable options, examples, and defaults for any command.\n",
    "For example, `man ls` describes the 30 or so flags available for `ls`."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ":::{note}\n",
    "\n",
    "All CLI tools we cover in this book are specific to the `sh` shell\n",
    "interpreter, the default interpreter for Jupyter installations on macos and\n",
    "Linux systems at the time of this writing. Windows systems have a different\n",
    "interpreter, and the commands shown in the book may not run on Windows, although\n",
    "Windows gives access to an `sh` interpreter through its Linux Subsystem.\n",
    "\n",
    "The commands in this section can be run in a terminal application, or through\n",
    "a terminal opened by Jupyter.\n",
    "\n",
    ":::"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We begin with an exploration of the filesystem containing the content for this chapter, using the `ls` tool:"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "$ ls\n",
    "\n",
    "data                            wrangling_granularity.ipynb\n",
    "figures                         wrangling_intro.ipynb                      \n",
    "wrangling_command_line.ipynb    wrangling_structure.ipynb\n",
    "wrangling_datasets.ipynb        wrangling_summary.ipynb\n",
    "wrangling_formats.ipynb       \n",
    "\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To dive deeper and list the files in the _data/_ directory, we provide the directory name as an argument to `ls`:"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "$ ls -l -L -h data/\n",
    "\n",
    "total 556664\n",
    "-rw-r--r--  1 nolan  staff   267M Dec 10 14:03 DAWN-Data.txt\n",
    "-rw-r--r--  1 nolan  staff   645K Dec 10 14:01 businesses.csv\n",
    "-rw-r--r--  1 nolan  staff    50K Jan 22 13:09 co2_mm_mlo.txt\n",
    "-rw-r--r--  1 nolan  staff   455K Dec 10 14:01 inspections.csv\n",
    "-rw-r--r--  1 nolan  staff   120B Dec 10 14:01 legend.csv\n",
    "-rw-r--r--  1 nolan  staff   3.6M Dec 10 14:01 violations.csv\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We added the `-l` flag to the command to get more information about each file.\n",
    "The file size appears in the fifth column of the listing, and it's more readable as specified by the `-h` flag.\n",
    "When we have multiple simple option flags like `-l`, `-h`, and `-L`, we can combine them together as a shorthand:\n",
    "```\n",
    "ls -lLh data/\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ":::{note}\n",
    "\n",
    "When working with datasets in this book, our code will often use an additional `-L` flag for `ls` and other CLI tools, such as `du`. We do this because we set up the datasets in the book using shortcuts (called _symlinks_). Usually, your code won't need the `-L` flag unless you're working with symlinks too. \n",
    "\n",
    ":::"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Other CLI tools for checking file size, are `wc` and `du`. The command `wc` (short for _word count_) provides helpful information about a file's size in terms of the number of lines, words, and characters in the file:"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "$ wc data/DAWN-Data.txt\n",
    "\n",
    "  229211 22695570 280095842 data/DAWN-Data.txt\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see from the output that _DAWN-Data.txt_ has 229,211 lines and 280,095,842 characters. (The middle value is the file's word count, which is useful for files that contain sentences and paragraphs bu not very useful for files containing data, such as FWF formatted values.)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `ls` tool does not calculate the cumulative size of the contents of a folder. To properly calculate the total size of a folder, including the files in the folder, we use `du` (short for _disk usage_). By default, the `du` tool shows the size in units called _blocks_:"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "$ du -L data/\n",
    "\n",
    "556664\tdata/\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We commonly add the `-s` flag to `du` to show the file sizes for both files and folders and the `-h` flag to display quantities in the standard\n",
    "KiB, MiB, or GiB format. The asterisk in `data/*` in the following code tells `du` to show the size of every item in the_data_ folder:"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "$ du -Lsh data/*\n",
    "\n",
    "267M\tdata/DAWN-Data.txt\n",
    "648K\tdata/businesses.csv\n",
    " 52K\tdata/co2_mm_mlo.txt\n",
    "456K\tdata/inspections.csv\n",
    "4.0K\tdata/legend.csv\n",
    "3.6M\tdata/violations.csv\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To check the formatting of a file, we can examine the first few lines with the `head` command or the last few lines with `tail`. These CLIs are very useful for peeking at a\n",
    "file's contents to determine whether it's formatted as CSV, TSV, and so on. As an example, let's\n",
    "look at the _inspections.csv_ file:"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "$ head -4 data/inspections.csv\n",
    "\n",
    "\"business_id\",\"score\",\"date\",\"type\"\n",
    "19,\"94\",\"20160513\",\"routine\"\n",
    "19,\"94\",\"20171211\",\"routine\"\n",
    "24,\"98\",\"20171101\",\"routine\"\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default, `head` displays the first 10 lines of a file. If we want to show,\n",
    "say, four lines, then we add the option `-n 4` to our command\n",
    "(or just `-4` for short).\n",
    "\n",
    "We can print the entire contents of the file using the `cat` command. However, you\n",
    "should take care when using this command, as printing a large file can cause a crash.\n",
    "The _legend.csv_ file is small, and we can use `cat` to concatenate and print its contents:"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "$ cat data/legend.csv\n",
    "\n",
    "\"Minimum_Score\",\"Maximum_Score\",\"Description\"\n",
    "0,70,\"Poor\"\n",
    "71,85,\"Needs Improvement\"\n",
    "86,90,\"Adequate\"\n",
    "91,100,\"Good\"\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In many cases, using `head` or `tail` alone gives us a good enough sense of the file structure to proceed with loading it into a dataframe."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, the `file` command can help us determine a file's encoding:"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "$ file -I data/*\n",
    "\n",
    "data/DAWN-Data.txt:   text/plain; charset=us-ascii\n",
    "data/businesses.csv:  application/csv; charset=iso-8859-1\n",
    "data/co2_mm_mlo.txt:  text/plain; charset=us-ascii\n",
    "data/inspections.csv: application/csv; charset=us-ascii\n",
    "data/legend.csv:      application/csv; charset=us-ascii\n",
    "data/violations.csv:  application/csv; charset=us-ascii\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see (again) that all of the files are ASCII, except for _businesses.csv_, which has an\n",
    "ISO-8859-1 encoding."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ":::{note} \n",
    "\n",
    "Commonly, we open a terminal program to start a shell interpreter. However,\n",
    "Jupyter notebooks provide a convenience: if a line of code in a Python code\n",
    "cell is prefixed with the `!` character, the line will go directly to the\n",
    "system’s shell interpreter. For example, running `!ls` in a Python cell lists\n",
    "the files in the current directory.\n",
    "\n",
    ":::"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Shell commands give us a programmatic way to work with files, rather than a\n",
    "point-and-click \"manual\" approach. They are useful for the following:"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Documentation\n",
    ": If you need to record what you did.\n",
    "\n",
    "Error reduction\n",
    ": If you want to reduce typographical errors and other simple but potentially harmful mistakes.\n",
    "\n",
    "Reproducibility\n",
    ": If you need to repeat the same process in the future or you\n",
    "  plan to share your process with others. This gives you a record of your actions.\n",
    "  \n",
    "Volume\n",
    ": If you have many repetitive operations to perform, the size of the file you are working with is large, or you need to perform things quickly. CLI tools can help in all these cases."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After the data have been loaded into a dataframe, our next task is to figure\n",
    "out the table's shape and granularity. We start by finding the number of rows and columns in the table (its shape). The we need to understand what a row represents before we begin to\n",
    "check the quality of the data. We cover these topics in the next section."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
